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This document is a guide to the implementation of true online emphatic TD(X), 
a model-free temporal-difference algorithm for learning to make long-term predic¬ 
tions which combines the emphasis idea (Sutton, Mahmood & White 2015) and the 
true-online idea (van Seijen & Sutton 2014). The setting used here includes linear 
function approximation, the possibility of off-policy training, and all the generality 
of general value functions (Maei &: Sutton 2010), as well as the emphasis algorithm’s 
notion of “interest”. Conventional TD(A) is of course the core model-free algorithm 
for learning value functions in reinforcement learning (Sutton 1988, Sutton & Barto 
1998). The emphasis idea is to dynamically rescale the updates made by temporal- 
difference algorithms such that convergence is ensured under off-policy training (Yu 
2015) and such that asymptotic accuracy of the approximation is improved. The 
true-online idea extends TD(A) to make it more data efficient and less sensitive to 
step-size settings, at minimal computational expense (van Seijen, Mahmood, Pi- 
larski & Sutton 2015). The way that these ideas have been combined to produce 
true online emphatic TD(A) was modelled after how van Hasselt, Mahmood, and 
Sutton (2014) combined the true-online idea and the gradient-TD idea (Maei 2011, 
Sutton et al. 2009) to produce true online GTD(A). 


1 Setting and requirements 

We consider the setting of general value functions, or GVFs (Maei & Sutton 2010, 
Sutton et al. 2011, White 2015, Sutton, Mahmood & White 2015). Here we present 
these ideas without assuming access to an underlying state (as in Modayil, White 
&: Sutton 2014). 

The algorithm is meant to be called at regular intervals with data from a time 
series, from which it learns to make a prediction. The time series includes a feature 
vector (pt G IK”' and a cumulant signal Rt G M. 
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The prediction at each time is linear in the feature vector. That is, the prediction 
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at time t > 0 is of the form 


4>t^t = 

i=l 

where 6t G M” is a learned weight vector at time t, and and 9t{i) are of course 
the ith components of the corresponding vectors. The learning process results in 
the prediction at each time t coming to approximate the outcome, or target, that 
would follow it: 

oo k—1 

E n 

k=t+l j=t+l 

if actions were selected according to policy vr, and where 'ft € [0,1] is a sequence 
of discount factors. We see from this equation why the signal Rt is termed the 
“cumulant”; all of its values are added up, or accumulated, within the temporal 
envelope specified by the 7 j. In the special case in which the cumulant is a reward 
and the 7 j are constant then the GVF reduces to a conventional value function from 
reinforcement learning. 

To make the GVF problem well defined, the user must provide tt and the 7 j. 
The policy tt is not provided directly, but in the form of a sequence of importance 
sampling ratios 

T^{At\St) 
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where St and At are the state and action actually taken at time t, and 7r(At\St) and 
fa{At\St) are the probabilities of At in St under policies vr and /a respectively. The 
policy TT is called the target policy, because it is under it that we are trying to predict 
the outcome, as stated above, and /i is called the behavior policy, because it is it 
that actually generates the behavior and the time series. Because only the ratio of 
the two probabilities is required, there is often no need to work directly with states 
or action probabilities. For example, in the on-policy case the target and behavior 
policies are the same, and the ratio is always one. The discount factors are often 
taken to be constant, but are allowed to depend arbitrarily on the time series, as 
long as Hjlt+i 7j = 0 

In some publications concerning general value functions there is also specified a 
fourth sequence pertaining to the prediction problem—the “terminal pseudo reward” 
Zt —to specify a final signal to be added in with the cumulants at termination. More 
recently its has been recognized that this functionality can be included with just the 
cumulant Rt by appropriately setting the discount sequence 'yt (see Modayil, White 
(fe Sutton 2014). For example, if one wanted a terminal pseudo reward of Zt only 
upon termination, then one would use a cumulant of = (1 — 'yt)Zt- 

In addition to the time series of the feature vectors and cumulant signals, the 
user must provide three sequences characterizing the nature of the approximation 
to be found by the algorithm: 
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• > 0 ; the interest sequenee specifies the interest in or importance of accu¬ 
rately predicting at time t > 0. For example, in episodic problems one may 
care only about the value of the first state of the episode; this is specihed by 
setting It = 1 for the first state of each episode and It = 0 at all other times. 
(Or, as suggested by the work of Thomas (2014), one may want to use It = 7 *, 
where t here is the time since the beginning of the episode.) In a discounted 
continuing task, on the other hand, one often cares about all the states equally, 
which is specihed by setting It = 1 for all t. In general, if one has any reason 
to be more concerned with the approximation being more accurate at some 
times than others, this can be expressed through the interest sequence. 

• Ai € [ 0 , 1 ]; the bootstrapping sequence specihes the degree of bootstrapping at 
each time. 

• ot > 0; the step-size sequence specihes the size of the step at each time. One 
common choice is a constant step-size parameter, e.g., at = 0 . 1 /max^ 
Another common choice is a step-size parameter that decreases to zero slowly 
over time. More sophisticated step-size adaptation methods could also be 
used to determine the step-size sequence (e.g., Mahmood et al. 2012, Dabney 
&: Barto 2012, Reidmiller & Braun 1993) 


2 Algorithm Specification 

Internal to the learning algorithm are the learned weight vector, 6t € M”, and an 
auxiliary shorter-term-memory vector et G M” with et > 0. In addition, there are 
the scalars Mt > 0 and F) > 0. The emphasis Mt and the TD error 5t are purely 
temporary variables. The true online emphatic TD(A) algorithm is fully specihed 
by the following equations: 

= Rt+i + 4>t+i — 4>t 

Ft = pt-i'jtFt-i + It, with F_i = 0 

Mt = AiR -|- (1 — Xt)Ft 

et = patXtet-i + ptatMt{l - pt’jtXtcpg et-i)cf)t with e_i = 0 
Ot+i = 0t + Stet + {et - atMtpt4>t){0t - Ot-iYcj)t 

3 Pseudocode 

The following pseudocode characterizes the algorithm and its efficient implementa¬ 
tion in C-|— 1 -. First the init function should be called with argument n (the number 
of components of 0 and 0 ): 


( 1 ) 

( 2 ) 

(3) 

(4) 

(5) 
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init(n): 
store n 
e 0 

0^0 (or arbitrary) 
F ^ Z) ^ 7 ^ 0 


On each step, t = 0,1, 2,..the learn function is called with arguments at, It, At, 4>t, Pt, 
Rt+i,4>t+i,'yt+i- 


learnCa, I, A, 4>, P, R, 4>' 1 7^ = 

; a thru p are at t, the rest are at t -|- 1 

5 ^ R + -f'e^cj)' -e^cj) 

; or, do all 3 inner products in a single loop 

F ^F + I 

; F was pt-i'jtFt-p, now it is Ft 

M ^ A/ + (1 - X)F 


S ^ paM{l — p^Xcf)^e) 

; scalar S saves computation 

e pyAe -|- Scj) 

; this -|- next 3 lines can be done in a single loop 

A •(— he -|- D{e — paMcf)) 

; D here is {Ot - Ot-i^(t)t 

0 ^ 0 +A 


D ^ 


F <r- pj'F 
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Finally, to obtain a prediction based on the learned weights, pass a feature vector 
to the predict function: 


predict( 0 ) : 
return 0^0 

If the task is episodic in the classical sense, then the terminal state should be 
represented as a special additional state at which 7 = 0 , 0 = 0 , and with outgoing 
transitions to the distribution of start states. As far as learn is concerned, there is 
still just a single sequence. 


4 Code 

Implementations that closely follow the pseudocode are provided for various pro¬ 
gramming languages in separate files. Where we have seen it as convenient and 
non-obfuscating, the implementations are in an object-oriented style in which one 
creates an instance of the algorithm that contains all of its internal variables. 
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